RE: x264 in or1ksim
by julius on Nov 30, 2009 |
julius
Posts: 363 Joined: Jul 1, 2008 Last seen: May 17, 2021 |
||
but the problem still exists.
I think I have spotted what might be wrong. Like I suspected I think it's the linker script. The newlib toolchain GCC uses a default linker script found in the or32-elf/lib/or32.ld path inside wherever you installed the toolchain. In the default linker script which gets installed when you compile and install gcc it only allocates the stack at 8MB, and I had forgotten that I had modified this so that it sets the top of the stack at the 25MB spot. I've attached my linker script. Put that in the /or32-elf/lib path in your toolchain directory (so, if you installed it to /opt/or32-newlib you'll want to put this script in /opt/or32-newlib/or32-elf/lib ) It sets the stack to start at 25MB mark, minus 1MB for program code (this is approximate), leaving us the top 7MB for the video frame data (5MB for raw, 2MB for output h.264 encoded data). I think this definitely explains it. Sorry for not spotting it sooner. Julius
or32.ld (1 kb)
|
RE: x264 in or1ksim
by kahomike on Dec 1, 2009 |
kahomike
Posts: 4 Joined: Aug 22, 2009 Last seen: Dec 14, 2009 |
||
With your new or32.ld, it works now, thanks a lot !
I am studying the feasibility of implementing a x264 encoder with MPSoC using OR1200s. I measured the clock cycles used with several CIF sequences. 30 Frames are encoded. The frame structure is IPPP..., 1 reference frame. The average Number of Clock Cycles = 65,554,724,350 Assume one 100MHz processor is used. To encode the 30 frames, it takes 65,554,724,350/100,000,000 =655.5472435 seconds. That is in average 655.5472435/30=21.85 seconds per frame. Assume with optimization the speed can be doubled, it still needs around 10 second to encode 1 CIF frame. I am thinking about this result. May I ask, do you have any opinion? Regards, Mike |
RE: x264 in or1ksim
by julius on Dec 1, 2009 |
julius
Posts: 363 Joined: Jul 1, 2008 Last seen: May 17, 2021 |
||
Yes, these are the kinds of figures I'm getting in or1ksim, and I've recently put an FPU into ORPSoC so we can run the code at cycle-accurate level and get numbers of cycles used by each function.
It's going to be an interesting problem to solve. Identifying precisely where (x264 function) we can and should (in terms of numbers of cycles) implement some sort of speed up is important. Another thing to consider is that x264 is very configurable, so perhaps we can turn off some of the features that might achieve greater compression or quality in favor for a reduction in complexity of the encoder. Now that I can run the software in the cycle-accurate model I will be looking at precise cycle numbers for the functions on a per-macroblock level to see what set of operations we should lend a hand to by way of hardware module. Also, ORPSoC, as it stands, is far from a finely-tuned high performance system, so perhaps there is room to improve that, but I don't see it resulting in a 20x speedup right there. I have a feeling modules with their own memories will greatly increase the rate of encoding, but it's easy to bite off more than you can chew with regard to implementing a C algorithm as an FSM at RTL. This will, of course, need to be done but the level at which we choose to do it is important. I think we'll probably want to do it above the level of say a simple SAD/SSD operation on a macroblock, but below the level of an entire ME algorithm (but perhaps near that kind of complexity.) Thoughts? |
RE: x264 in or1ksim
by eejlny on Dec 1, 2009 |
eejlny
Posts: 3 Joined: Mar 2, 2006 Last seen: Aug 20, 2017 |
||
-- Below the level of an entire ME algorithm.
Why?. Certainly one of the first candidates for acceleration is certainly the ME functions specially with extra functionality of fractional-pels, Lagrangian optimization, multiple reference frames,etc. This could reduce complexity by 50-70%. For example x.264 spends a lot cycles not just during ME itself but also in the analyse.c functions that need a lot of interpolations. Myself I have made available in opencores a configurable and programmable ME processor that can do these functions at high-definition levels of performance in low cost FPGAs. It is in VHDL and the current interface only supports AMBA AHB and not wishbone so some work to be done in there. Just have a look for ME processor if you think this is suitable. Regards,
Yes, these are the kinds of figures I'm getting in or1ksim, and I've recently put an FPU into ORPSoC so we can run the code at cycle-accurate level and get numbers of cycles used by each function.
It's going to be an interesting problem to solve. Identifying precisely where (x264 function) we can and should (in terms of numbers of cycles) implement some sort of speed up is important. Another thing to consider is that x264 is very configurable, so perhaps we can turn off some of the features that might achieve greater compression or quality in favor for a reduction in complexity of the encoder. Now that I can run the software in the cycle-accurate model I will be looking at precise cycle numbers for the functions on a per-macroblock level to see what set of operations we should lend a hand to by way of hardware module. Also, ORPSoC, as it stands, is far from a finely-tuned high performance system, so perhaps there is room to improve that, but I don't see it resulting in a 20x speedup right there. I have a feeling modules with their own memories will greatly increase the rate of encoding, but it's easy to bite off more than you can chew with regard to implementing a C algorithm as an FSM at RTL. This will, of course, need to be done but the level at which we choose to do it is important. I think we'll probably want to do it above the level of say a simple SAD/SSD operation on a macroblock, but below the level of an entire ME algorithm (but perhaps near that kind of complexity.) Thoughts? |
RE: x264 in or1ksim
by kahomike on Dec 2, 2009 |
kahomike
Posts: 4 Joined: Aug 22, 2009 Last seen: Dec 14, 2009 |
||
Yes, these are the kinds of figures I'm getting in or1ksim, and I've recently put an FPU into ORPSoC so we can run the code at cycle-accurate level and get numbers of cycles used by each function.
I am not familiar with this, but may I ask, will double precision floating point support of or1ksim improve the x264 performance? (because you mentioned that without single precision floating point stuff enabled the simulation is incredibly slow)
Another thing to consider is that x264 is very configurable, so perhaps we can turn off some of the features that might achieve greater compression or quality in favor for a reduction in complexity of the encoder.
I guess the ultrafast config of x264 already uses only the essential features of H.264. With more features disabled, I worry that the rate-distortion performance will be too low that it will be similar to MPEG2/4 only. Regards, Mike |
RE: x264 in or1ksim
by julius on Jan 15, 2010 |
julius
Posts: 363 Joined: Jul 1, 2008 Last seen: May 17, 2021 |
||
Myself I have made available in opencores a configurable and programmable ME processor that can do these functions at high-definition levels of performance in low cost FPGAs.
This project looks very good. Am I right in thinking you too are using x264 as a base to run this on/with? I notice on the project's home page you built a compiler too which looks like it converts some macros into code which configures and runs this ME processor. This is a good idea, as it provides flexibility. |
RE: x264 in or1ksim
by julius on Jan 15, 2010 |
julius
Posts: 363 Joined: Jul 1, 2008 Last seen: May 17, 2021 |
||
I am not familiar with this, but may I ask, will double precision floating point support of or1ksim improve the x264 performance? (because you mentioned that without single precision floating point stuff enabled the simulation is incredibly slow)
Yes, this helped a little bit, although it wasn't crucial. In hindsight, though, it's not really needed, and most of the floating point stuff is double precision anyway, which it must do in software because the current or1200 has 32-bit registers.
I guess the ultrafast config of x264 already uses only the essential features of H.264. With more features disabled, I worry that the rate-distortion performance will be too low that it will be similar to MPEG2/4 only.
Sure, one of the defining features of H.264 is the greater choice of options taking advantage of spatial and temporal redundancies, all of which require greater complexity in the encoder. To ignore these would be to miss the point of the exercise, so hopefully we can build something which does at least some of the methods unique to H.264/AVC, and eventually most or all of them. But to begin with, it'd be easiest to target something which runs with the simpler ME methods, but keeping in mind we'll eventually want the ability to use smaller sub-block sizes and more complex search areas. |
RE: x264 in or1ksim
by eejlny on Jan 31, 2010 |
eejlny
Posts: 3 Joined: Mar 2, 2006 Last seen: Aug 20, 2017 |
||
Myself I have made available in opencores a configurable and programmable ME processor that can do these functions at high-definition levels of performance in low cost FPGAs.
This project looks very good. Am I right in thinking you too are using x264 as a base to run this on/with? I notice on the project's home page you built a compiler too which looks like it converts some macros into code which configures and runs this ME processor. This is a good idea, as it provides flexibility. Yes, it uses x264 and replaces the motion estimation functions with the core . The compiler is quite powerful and can be implement complex fast block-matching algorithms such as UMH, PMVFAST and of course classical hexagonal or diamond search!. There are basically infinite ways of programming and configuring the core. |
RE: x264 in or1ksim
by vrpatil on Feb 28, 2010 |
vrpatil
Posts: 10 Joined: Jun 24, 2009 Last seen: Jun 22, 2016 |
||
I tried to compile x264, as mentioned in first post.
But I got error like below in configure script ./configure --disable-avis-input --disable-mp4-output --disable-pthread --enable-debug --host=or32-linux --cross-prefix=or32-elf- --extra-cflags="-g -mhard-mul -mhard-div -mhard-float" --extra-ldflags="-Tlink.ld" /opt/OR1x00/or32-newlib/lib/gcc/or32-elf/4.2.2/../../../../or32-elf/bin/ld: crt0.o: No such file: No such file or directory collect2: ld returned 1 exit status No working C compiler found. I have built binutils, gcc as mentioned. |